This project builds a classification model using logistic regression to analyze various diabetic patient features to predict whether or not the patient will be readmitted to the hospital again or not.
For this project, we are trying to answer the predictive question: Given a diabetic patient’s demographic, medication history and management of diabetes during hospital stay, can we predict if they will be readmitted to the hospital or not?
Due to Covid-19, it is critical to reduce the burden on the healthcare system and prevent readmission rates from increasing to make space for Covid cases. Our predictor aims to look at the diabetes management and diagnosis during a patient’s hospital stay to understand how much this affects their readmission. Analysis with machine learning models will identify features more likely to predict patient readmission. This will allow us to create and improve patient safety protocols to better manage diabetic patients during their hospital stay to provide effective care and prevent readmission during this critical time.
The R programming language (R Core Team 2020) and Python programming language (Van Rossum and Drake 2009) were used to perform the analysis. The following R and Python packages were also used to perform the analysis:
For statistical analysis (SCRIPT4) specifically:
The code used to perform the analysis and create this report can be found here.
(“Insulin, Medicines, &Amp; Other Diabetes Treatments” 2016) (2019)
The data are submitted on behalf of the Center for Clinical and Translational Research, Virginia Commonwealth University, a recipient of NIH CTSA grant UL1 TR00058, and a recipient of the CERNER data. This dataset was collected from 1998-2008 among 130 hospitals and integrated delivery networks throughout the United States of America.
This data set was sourced from the UCI Machine Learning Repository (Strack et al. 2014a) and can be found here. Research from this collected data was used to assess diabetic care during hospitalization and determine if patients were likely to be readmitted or not. The paper by (Strack et al. 2014b) can be found here. Each row corresponds a unique encounter with a diabetic patient, totaling 74,036,643 unique encounters. Details about each column feature of information collected during these unique encounters can be found here.
After data cleaning, machine learning model logistic regression was tested against radial basis function kernel with support vector machine (RBF SVM) and a baseline dummy classifier method. Logistic regression was determined as the best model in terms of fit and score time, accuracy, and f1 score. Continuing with logistic regression, hyperparameters were optimized and our model was then used to predict diabetic patient readmission (found in the readmitted target column of the data set). The code used to perform the analysis and create this report can be found [here].(INSERT_URL_TO_SCRIPT_4)
Through exploratory data analysis, we determined that some of the features were not informative to answering our question or contained many missing values. This was confirmed through Pandas Profiling which can be found here as well as correlation between specific features and potential class imbalance based on the target readmitted column. Correlation between certain numerical values shown in Pandas Profiling was confirmed when we analyzed interactions between the features.